Advanced data manipulation techniques

Stat405
Advanced data manipulation

Hadley Wickham
Tuesday, 28 September 2010

1. Baby names data
2. Slicing and dicing revision
3. Merging data
4. Group-wise operations


Baby names
Top 1000 male and female baby
names in the US, from 1880 to
2008.
258,000 records (1000 * 2 * 129)
But only ﬁve variables: year,
name, soundex, sex and prop.

CC BY http://www.ﬂickr.com/photos/the_light_show/2586781132

Getting started
library(plyr)
library(ggplot2)

options(stringsAsFactors = FALSE)
# Can read compressed files
bnames <- read.csv("baby-names2.csv.bz2")

# Can read files from website
births <- read.csv(
"http://had.co.nz/stat405/data/births.csv")

# Unfortunately can't do both at the same time :(


> head(bnames, 20) > tail(bnames, 20)
year name soundex prop sex year name soundex prop sex
1 1880 John J500 0.081541 boy 257981 2008 Miya M000 0.000130 girl
2 1880 William W450 0.080511 boy 257982 2008 Rory R600 0.000130 girl
3 1880 James J520 0.050057 boy 257983 2008 Desirae D260 0.000130 girl
4 1880 Charles C642 0.045167 boy 257984 2008 Kianna K500 0.000130 girl
5 1880 George G620 0.043292 boy 257985 2008 Laurel L640 0.000130 girl
6 1880 Frank F652 0.027380 boy 257986 2008 Neveah N100 0.000130 girl
7 1880 Joseph J210 0.022229 boy 257987 2008 Amaris A562 0.000129 girl
8 1880 Thomas T520 0.021401 boy 257988 2008 Hadassah H320 0.000129 girl
9 1880 Henry H560 0.020641 boy 257989 2008 Dania D500 0.000129 girl
10 1880 Robert R163 0.020404 boy 257990 2008 Hailie H400 0.000129 girl
11 1880 Edward E363 0.019965 boy 257991 2008 Jamiya J500 0.000129 girl
12 1880 Harry H600 0.018175 boy 257992 2008 Kathy K300 0.000129 girl
13 1880 Walter W436 0.014822 boy 257993 2008 Laylah L400 0.000129 girl
14 1880 Arthur A636 0.013504 boy 257994 2008 Riya R000 0.000129 girl
15 1880 Fred F630 0.013251 boy 257995 2008 Diya D000 0.000128 girl
16 1880 Albert A416 0.012609 boy 257996 2008 Carleigh C642 0.000128 girl
17 1880 Samuel S540 0.008648 boy 257997 2008 Iyana I500 0.000128 girl
18 1880 David D130 0.007339 boy 257998 2008 Kenley K540 0.000127 girl
19 1880 Louis L200 0.006993 boy 257999 2008 Sloane S450 0.000127 girl
20 1880 Joe J000 0.006174 boy 258000 2008 Elianna E450 0.000127 girl


Your turn

Extract your name from the dataset. Plot
the trend over time.
What geom should you use? Do you
need any extra aesthetics?


hadley <- subset(bnames, name == "Hadley")

qplot(year, prop, data = hadley, colour = sex,
geom ="line")
# :(


Your turn

Use the soundex variable to extract all
names that sound like yours. Plot the
trend over time.
Do you have any difﬁculties? Think about
grouping.


gabi <- subset(bnames, soundex == "G164")
qplot(year, prop, data = gabi)
qplot(year, prop, data = gabi, geom = "line")

qplot(year, prop, data = gabi, geom = "line",
colour = sex) + facet_wrap(~ name)

qplot(year, prop, data = gabi, geom = "line",
colour = sex, group = interaction(sex, name))


Sawtooth appearance
implies grouping is incorrect.
0.005

0.004

sex
prop

0.003 boy
girl

0.002

0.001

1880 1900 1920 1940 1960 1980 2000
year

Slicing
and dicing

Function Package
subset base
summarise plyr
transform base
arrange plyr

They all have similar syntax. The ﬁrst argument
is a data frame, and all other arguments are
interpreted in the context of that data frame.
Each returns a data frame.


color value color value
blue 1 blue 1
black 2 blue 3
blue 3 blue 4
blue 4
black 5

subset(df, color == "blue")


color value color value double
blue 1 blue 1 2
black 2 black 2 4
blue 3 blue 3 6
blue 4 blue 4 8
black 5 black 5 10

transform(df, double = 2 * value)


color value double
blue 1 2
black 2 4
blue 3 6
blue 4 8
black 5 10

summarise(df, double = 2 * value)


color value total
blue 1 15
black 2
blue 3
blue 4
black 5

summarise(df, total = sum(value))


4 1 1 2
1 2 2 5
5 3 3 4
3 4 4 1
2 5 5 3

arrange(df, color)


4 1 5 3
1 2 4 1
5 3 3 4
3 4 2 5
2 5 1 2

arrange(df, desc(color))


Your turn

Calculate the total, largest and smallest
proportions.
Reorder the data frame containing your
name from highest to lowest popularity.


summarise(bnames,
total = sum(prop),
largest = max(prop),
smallest = min(prop))

arrange(hadley, desc(prop))


Brainstorm

Thinking about the data, what are some
of the trends that you might want to
explore? What additional variables would
you need to create? What other data
sources might you want to use?
Pair up and brainstorm for 2 minutes.


External Internal

First/last letter
Biblical names
Length
Hurricanes
Vowels
Ethnicity
Rank
Famous people
Sounds-like

join ddply

Merging
data

Combining datasets
Name instrument Name band
John guitar John T
Paul bass Paul T
George guitar
Ringo drums
+ George T
Ringo T
= ?
Stuart bass Brian F
Pete drums


x y
Name instrument Name band Name instrument band
John guitar John T John guitar T
Paul bass Paul T Paul bass T
George guitar + George T = George guitar T
Ringo drums Ringo T Ringo drums T
Stuart bass Brian F Stuart bass NA
Pete drums Pete drums NA

join(x, y, type = "left")


x y
Stuart bass Brian F Brian NA F
Pete drums

join(x, y, type = "right")


x y
Stuart bass Brian F
Pete drums

join(x, y, type = "inner")


x y
Stuart bass Brian F Stuart bass NA
Pete drums Pete drums NA
Brian NA F

join(x, y, type = "full")


Type Action

Include all of x, and
"left"
matching rows of y
Include all of y, and
"right"
matching rows of x
Include only rows in
"inner"
both x and y

"full" Include all rows


Your turn

Convert from proportions to absolute
numbers by combining bnames with births,
and then performing the appropriate
calculation.


bnames2 <- join(bnames, births,
by = c("year", "sex"))
tail(bnames2)

bnames2 <- transform(bnames2, n = prop * births)
tail(bnames2)

bnames2 <- transform(bnames2,
n = round(prop * births))
tail(bnames2)


2000000

1500000

sex
births

boy

1000000 girl

ild
ch
n or
io f
d

ct ed
500000
ue

du ed
ss

de ne
ti
rs

x :
ta 86
:ﬁ

19
36
19

1880 1900 1920 1940 1960 1980 2000
year

Group-wise
operations


Number of people

How do we compute the number of
people with each name over all years? It’s
pretty easy if you have a single name.
How would you do it?


hadley <- subset(bnames2, name == "Hadley")
sum(hadley$n)

# Or
summarise(hadley, n = sum(n))

# But how could we do this for every name?


# Split
pieces <- split(bnames2, list(bnames$name))

# Apply
results <- vector("list", length(pieces))
for(i in seq_along(pieces)) {
piece <- pieces[[i]]
results[[i]] <- summarise(piece, n = sum(n))
}

# Combine
result <- do.call("rbind", results)


# Or equivalently

counts <- ddply(bnames2, "name", summarise,
n = sum(n))


Way to split
Input data
up input
# Or equivalently

counts <- ddply(bnames2, "name", summarise,
n = sum(n))
Function to apply to
each piece
2nd argument
to summarise()


x y

a 2
a 4
b 0
b 5
c 5
c 10


Split
x y

x y a 2
a 2 a 4
a 4 x y

b 0 b 0
b 5 b 5
c 5 x y

c 10 c 5
c 10

Split Apply
x y

x y a 2
3
a 2 a 4
a 4 x y

b 0 b 0
2.5
b 5 b 5
c 5 x y

c 10 c 5
7.5
c 10

Split Apply Combine
x y

x y a 2
3
a 2 a 4
a 4
x y

x y
a 2
b 0 b 0
2.5 b 2.5
b 5 b 5
c 7.5
c 5 x y

c 10 c 5
7.5
c 10

Your turn

Repeat the same operation, but use
soundex instead of name. What is the
most common sound? What name does
it correspond to?


scounts <- ddply(bnames2, "soundex", summarise,
n = sum(n))
scounts <- arrange(scounts, desc(n))

# Combine with names
# When there are multiple possible matches,
# join picks the first
scounts <- join(
scounts, bnames2[, c("soundex", "name")],
by = "soundex")
head(scounts, 100)

subset(bnames, soundex == "L600")


# Alternative approach that you'll learn more
# about on Thursday

library(stringr)
scounts <- ddply(bnames2, "soundex", summarise,
n = sum(n),
names = str_c(sort(unique(name)), collapse = ","))
scounts <- arrange(scounts, desc(n))


Advanced data manipulation techniques

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

More from Hadley Wickham

More from Hadley Wickham (10)

Advanced data manipulation techniques